7 research outputs found
Can You Explain That? Lucid Explanations Help Human-AI Collaborative Image Retrieval
While there have been many proposals on making AI algorithms explainable, few
have attempted to evaluate the impact of AI-generated explanations on human
performance in conducting human-AI collaborative tasks. To bridge the gap, we
propose a Twenty-Questions style collaborative image retrieval game,
Explanation-assisted Guess Which (ExAG), as a method of evaluating the efficacy
of explanations (visual evidence or textual justification) in the context of
Visual Question Answering (VQA). In our proposed ExAG, a human user needs to
guess a secret image picked by the VQA agent by asking natural language
questions to it. We show that overall, when AI explains its answers, users
succeed more often in guessing the secret image correctly. Notably, a few
correct explanations can readily improve human performance when VQA answers are
mostly incorrect as compared to no-explanation games. Furthermore, we also show
that while explanations rated as "helpful" significantly improve human
performance, "incorrect" and "unhelpful" explanations can degrade performance
as compared to no-explanation games. Our experiments, therefore, demonstrate
that ExAG is an effective means to evaluate the efficacy of AI-generated
explanations on a human-AI collaborative task.Comment: 2019 AAAI Conference on Human Computation and Crowdsourcin
The Impact of Explanations on AI Competency Prediction in VQA
Explainability is one of the key elements for building trust in AI systems.
Among numerous attempts to make AI explainable, quantifying the effect of
explanations remains a challenge in conducting human-AI collaborative tasks.
Aside from the ability to predict the overall behavior of AI, in many
applications, users need to understand an AI agent's competency in different
aspects of the task domain. In this paper, we evaluate the impact of
explanations on the user's mental model of AI agent competency within the task
of visual question answering (VQA). We quantify users' understanding of
competency, based on the correlation between the actual system performance and
user rankings. We introduce an explainable VQA system that uses spatial and
object features and is powered by the BERT language model. Each group of users
sees only one kind of explanation to rank the competencies of the VQA model.
The proposed model is evaluated through between-subject experiments to probe
explanations' impact on the user's perception of competency. The comparison
between two VQA models shows BERT based explanations and the use of object
features improve the user's prediction of the model's competencies.Comment: Submitted to HCCAI 202
Learning Invariant World State Representations with Predictive Coding
Self-supervised learning methods overcome the key bottleneck for building
more capable AI: limited availability of labeled data. However, one of the
drawbacks of self-supervised architectures is that the representations that
they learn are implicit and it is hard to extract meaningful information about
the encoded world states, such as 3D structure of the visual scene encoded in a
depth map. Moreover, in the visual domain such representations only rarely
undergo evaluations that may be critical for downstream tasks, such as vision
for autonomous cars. Herein, we propose a framework for evaluating visual
representations for illumination invariance in the context of depth perception.
We develop a new predictive coding-based architecture and a hybrid
fully-supervised/self-supervised learning method. We propose a novel
architecture that extends the predictive coding approach: PRedictive Lateral
bottom-Up and top-Down Encoder-decoder Network (PreludeNet), which explicitly
learns to infer and predict depth from video frames. In PreludeNet, the
encoder's stack of predictive coding layers is trained in a self-supervised
manner, while the predictive decoder is trained in a supervised manner to infer
or predict the depth. We evaluate the robustness of our model on a new
synthetic dataset, in which lighting conditions (such as overall illumination,
and effect of shadows) can be be parametrically adjusted while keeping all
other aspects of the world constant. PreludeNet achieves both competitive depth
inference performance and next frame prediction accuracy. We also show how this
new network architecture, coupled with the hybrid
fully-supervised/self-supervised learning method, achieves balance between the
said performance and invariance to changes in lighting. The proposed framework
for evaluating visual representations can be extended to diverse task domains
and invariance tests.Comment: 11 pages, 5 figures, submitte